Skip to content

Conversation

jiridanek
Copy link
Member

@jiridanek jiridanek commented Sep 23, 2025

Description

How Has This Been Tested?

Self checklist (all need to be checked):

  • Ensure that you have run make test (gmake on macOS) before asking for review
  • Changes to everything except Dockerfile.konflux files should be done in odh/notebooks and automatically synced to rhds/notebooks. For Konflux-specific changes, modify Dockerfile.konflux files directly in rhds/notebooks as these require special attention in the downstream repository and flow to the upcoming RHOAI release.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

Summary by CodeRabbit

  • Chores
    • Reduced CI swap allocation from 16GB to ~8GB.
    • Improves CI runner resource efficiency and stability while having no impact on application behavior or user experience.

Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

Walkthrough

Changed swap_size_mb in ci/cached-builds/gha_lvm_overlay.sh from 16384 to 8196 and added a comment explaining 16GB is wasteful and 0.5GB is insufficient for arm64; no other logic, control-flow, or error-handling changes.

Changes

Cohort / File(s) Summary of changes
CI LVM overlay swap sizing
ci/cached-builds/gha_lvm_overlay.sh
Updated swap_size_mb from 16384 to 8196 and added a comment clarifying the size choice; surrounding logic unchanged.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

Possibly related issues

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description follows the template headings but omits any detailed explanation of the changes under the Description section, leaving this critical information blank, and the self-checklist items remain unchecked, so the description is incomplete. Please fill in the Description section with a clear summary of what was changed and why, and complete the self-checklist by marking each item as done once verified.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Title Check ✅ Passed The pull request title clearly summarizes the primary change by stating the reduction of swap size in the gha_lvm_overlay.sh script and includes the issue reference; it is concise, specific, and directly reflects the actual modifications made.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions bot added the review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel label Sep 23, 2025
@openshift-ci openshift-ci bot added the size/xs label Sep 23, 2025
@jiridanek jiridanek linked an issue Sep 23, 2025 that may be closed by this pull request
@openshift-ci openshift-ci bot added size/xs and removed size/xs labels Sep 23, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
ci/cached-builds/gha_lvm_overlay.sh (2)

14-15: Make swap size configurable; 512MiB may be too low for source builds

512MiB risks OOM during codeserver/source compiles. Keep the lower default but allow easy override via env. Consider later aligning with Issue #2398 for dynamic sizing (≥1024MiB, ~20% VG free).

Apply this minimal change:

-# https://github.com/opendatahub-io/notebooks/issues/2404
-swap_size_mb=512
+# https://github.com/opendatahub-io/notebooks/issues/2404
+# Allow override; default intentionally small. Adjust via SWAP_SIZE_MB when needed.
+swap_size_mb="${SWAP_SIZE_MB:-512}"

Please verify in CI that memory‑heavy jobs (e.g., codeserver builds) don’t hit OOM with the new default. If they do, bump via SWAP_SIZE_MB or implement the dynamic approach from Issue #2398.


58-59: Harden swap creation: force mkswap and lower swappiness (per Issue #2398)

Prevents interactive prompts on stale signatures and reduces aggressive swapping on CI.

Apply:

-sudo mkswap "/dev/mapper/${VG_NAME}-swap"
-sudo swapon "/dev/mapper/${VG_NAME}-swap"
+sudo mkswap -f "/dev/mapper/${VG_NAME}-swap"
+sudo swapon "/dev/mapper/${VG_NAME}-swap"
+# Reduce swap aggressiveness on CI
+sudo sysctl -w vm.swappiness=10 || true
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8feb92d and 71c0b5a.

📒 Files selected for processing (1)
  • ci/cached-builds/gha_lvm_overlay.sh (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. Issue #2398 was created addressing lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. The issue addresses lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-07-22T07:30:12.760Z
Learning: jiridanek requested GitHub issue creation for GitHub Actions LVM setup failure during PR #1425 review, specifically addressing ext4 signature detection causing lvcreate to fail with interactive prompts. Issue #1434 was successfully created with comprehensive problem description covering stale filesystem signatures, interactive prompt failures, runner cleanup issues, disk space pressure, detailed technical analysis, multiple solution options (immediate signature wiping, robust pre-cleanup logic, enhanced runner cleanup, alternative volume strategies), clear acceptance criteria, implementation guidance, risk mitigation strategies, and proper context linking. This infrastructure issue is related to but distinct from existing Issue #1196 about fallocate failures in the same gha_lvm_overlay.sh script, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-08-05T17:24:08.616Z
Learning: jiridanek requested PR review for #1521 covering s390x architecture support improvements, demonstrating continued focus on systematic multi-architecture compatibility enhancements in the opendatahub-io/notebooks repository through clean implementation with centralized configuration, proper CI integration, and architecture-aware testing patterns.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#2215
File: runtimes/datascience/ubi9-python-3.12/Dockerfile.cpu:0-0
Timestamp: 2025-09-05T11:27:31.040Z
Learning: jiridanek requested GitHub issue creation for build toolchain optimization in datascience runtime during PR #2215 review. Issue #2308 was created addressing unnecessary build dependencies (gcc-toolset-13, cmake, ninja-build, rust, cargo) in final runtime image for ppc64le architecture, covering comprehensive problem analysis with specific line numbers, multiple solution options for builder-only toolchains, clear acceptance criteria for size reduction and security improvement, detailed implementation guidance for package segregation, and proper context linking to PR #2215 review comment, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#2265
File: .tekton/odh-workbench-jupyter-pytorch-llmcompressor-cuda-py312-ubi9-pull-request.yaml:16-16
Timestamp: 2025-09-05T12:10:28.916Z
Learning: jiridanek requested GitHub issue creation for trigger path cleanup in pytorch+llmcompressor pipeline during PR #2265 review. Issue #2310 was successfully created addressing copy-paste errors where irrelevant Minimal/DataScience trigger paths were included in the pytorch+llmcompressor pipeline on-cel-expression, causing unnecessary pipeline triggers. The issue includes comprehensive problem description covering specific irrelevant paths, detailed solution with before/after YAML code examples, clear acceptance criteria for implementation and testing, repository-wide scope consideration for similar issues, and proper context linking to PR #2265 review comment, assigned to jiridanek.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:24:07.914Z
Learning: jiridanek requested GitHub issue creation for Docker chown optimization in codeserver/ubi9-python-3.12/Dockerfile.cpu during PR #2356 review. Issue #2403 was created addressing performance impact of broad recursive chown on entire /opt/app-root directory (line 235), proposing three solution approaches: scoped chown targeting specific changed paths, root cause fix during file creation, and test modification for permission validation, with detailed benefits analysis covering layer size reduction and build time optimization, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#2145
File: jupyter/minimal/ubi9-python-3.12/Dockerfile.cpu:70-70
Timestamp: 2025-08-27T15:00:11.931Z
Learning: jiridanek requested GitHub issue creation for uv pip sync optimization during PR #2145 review. Issue #2150 was created addressing the systematic replacement of `uv pip install --requirements=` with `uv pip sync` across 34+ Dockerfiles migrated from requirements.txt to pylock.toml. The issue includes comprehensive problem description covering drift risk and redundant flags, detailed solution with benefits analysis, clear acceptance criteria for all affected files, and proper context linking to PR #2145 review comment, assigned to jiridanek.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#2217
File: .github/workflows/security.yaml:22-30
Timestamp: 2025-08-29T15:01:07.674Z
Learning: jiridanek created PR #2223 to systematically address uv setup configuration improvements (version pinning, caching, environment activation) during PR #2217 review, continuing the established pattern of implementing systematic improvements through dedicated PRs rather than applying individual point fixes during reviews.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#2185
File: jupyter/pytorch/ubi9-python-3.12/Dockerfile.cuda:109-117
Timestamp: 2025-08-29T08:48:55.985Z
Learning: jiridanek prefers to implement systematic cleanup improvements through dedicated GitHub issues (like #2076) rather than applying individual point fixes during PR reviews, maintaining consistency with their established pattern of comprehensive code quality improvements.
📚 Learning: 2025-09-10T21:21:02.884Z
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. Issue #2398 was created addressing lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.

Applied to files:

  • ci/cached-builds/gha_lvm_overlay.sh
📚 Learning: 2025-09-10T21:21:02.884Z
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. The issue addresses lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.

Applied to files:

  • ci/cached-builds/gha_lvm_overlay.sh
📚 Learning: 2025-07-22T07:30:12.760Z
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-07-22T07:30:12.760Z
Learning: jiridanek requested GitHub issue creation for GitHub Actions LVM setup failure during PR #1425 review, specifically addressing ext4 signature detection causing lvcreate to fail with interactive prompts. Issue #1434 was successfully created with comprehensive problem description covering stale filesystem signatures, interactive prompt failures, runner cleanup issues, disk space pressure, detailed technical analysis, multiple solution options (immediate signature wiping, robust pre-cleanup logic, enhanced runner cleanup, alternative volume strategies), clear acceptance criteria, implementation guidance, risk mitigation strategies, and proper context linking. This infrastructure issue is related to but distinct from existing Issue #1196 about fallocate failures in the same gha_lvm_overlay.sh script, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.

Applied to files:

  • ci/cached-builds/gha_lvm_overlay.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: code-static-analysis
  • GitHub Check: Trivy scan (fs)

Copy link
Member

@atheo89 atheo89 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Sep 23, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ide-developer
Once this PR has been reviewed and has the lgtm label, please ask for approval from atheo89. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

openshift-ci bot commented Sep 23, 2025

New changes are detected. LGTM label has been removed.

@openshift-ci openshift-ci bot added size/xs and removed size/xs labels Sep 23, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
ci/cached-builds/gha_lvm_overlay.sh (1)

14-16: Prefer 8192MiB (power-of-two) and add mkswap -f; consider dynamic sizing

8196 looks accidental — change to 8192. Use mkswap -f to avoid interactive signature prompts. Optional follow-up: derive swap size from VG free (≈20% with 1024MiB min) and set vm.swappiness=10 (track under #2398).

Apply within this hunk:

-# 16GB is wasteful, 0.5GB is not enough, arm64 froze with 2GB and 4GB
-# https://github.com/opendatahub-io/notebooks/issues/2404
-swap_size_mb=8196
+# 16GiB is wasteful; 0.5GiB insufficient; arm64 froze with 2GiB and 4GiB
+# https://github.com/opendatahub-io/notebooks/issues/2404
+swap_size_mb=8192

Outside this hunk:

- sudo mkswap "/dev/mapper/${VG_NAME}-swap"
+ sudo mkswap -f "/dev/mapper/${VG_NAME}-swap"
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e442a68 and 9b2194c.

📒 Files selected for processing (1)
  • ci/cached-builds/gha_lvm_overlay.sh (1 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. Issue #2398 was created addressing lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. The issue addresses lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-08-05T17:24:08.616Z
Learning: jiridanek requested PR review for #1521 covering s390x architecture support improvements, demonstrating continued focus on systematic multi-architecture compatibility enhancements in the opendatahub-io/notebooks repository through clean implementation with centralized configuration, proper CI integration, and architecture-aware testing patterns.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-07-22T07:30:12.760Z
Learning: jiridanek requested GitHub issue creation for GitHub Actions LVM setup failure during PR #1425 review, specifically addressing ext4 signature detection causing lvcreate to fail with interactive prompts. Issue #1434 was successfully created with comprehensive problem description covering stale filesystem signatures, interactive prompt failures, runner cleanup issues, disk space pressure, detailed technical analysis, multiple solution options (immediate signature wiping, robust pre-cleanup logic, enhanced runner cleanup, alternative volume strategies), clear acceptance criteria, implementation guidance, risk mitigation strategies, and proper context linking. This infrastructure issue is related to but distinct from existing Issue #1196 about fallocate failures in the same gha_lvm_overlay.sh script, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#2215
File: runtimes/datascience/ubi9-python-3.12/Dockerfile.cpu:0-0
Timestamp: 2025-09-05T11:27:31.040Z
Learning: jiridanek requested GitHub issue creation for build toolchain optimization in datascience runtime during PR #2215 review. Issue #2308 was created addressing unnecessary build dependencies (gcc-toolset-13, cmake, ninja-build, rust, cargo) in final runtime image for ppc64le architecture, covering comprehensive problem analysis with specific line numbers, multiple solution options for builder-only toolchains, clear acceptance criteria for size reduction and security improvement, detailed implementation guidance for package segregation, and proper context linking to PR #2215 review comment, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#1259
File: jupyter/rocm/tensorflow/ubi9-python-3.12/kustomize/base/service.yaml:5-15
Timestamp: 2025-07-02T18:59:15.788Z
Learning: jiridanek creates targeted GitHub issues for specific test quality improvements identified during PR reviews in opendatahub-io/notebooks. Issue #1268 demonstrates this by converting a review comment about insufficient tf2onnx conversion test validation into a comprehensive improvement plan with clear acceptance criteria, code examples, and ROCm-specific context.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#1909
File: runtimes/pytorch+llmcompressor/ubi9-python-3.11/Dockerfile.cuda:11-15
Timestamp: 2025-08-12T08:40:55.286Z
Learning: jiridanek requested GitHub issue creation for redundant CUDA upgrade optimization during PR #1909 review. Analysis revealed all 14 CUDA Dockerfiles contain redundant `yum upgrade -y` commands in cuda-base stages that execute after base stages already performed comprehensive `dnf upgrade` via pre-upgrade blocks, causing unnecessary CI latency and build inefficiency. Issue includes complete scope analysis with specific line numbers, investigation framework requiring NVIDIA upstream documentation review, multiple solution options, comprehensive acceptance criteria covering systematic testing and performance measurement, and proper context linking to PR #1909 review comment.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:24:07.914Z
Learning: jiridanek requested GitHub issue creation for Docker chown optimization in codeserver/ubi9-python-3.12/Dockerfile.cpu during PR #2356 review. Issue #2403 was created addressing performance impact of broad recursive chown on entire /opt/app-root directory (line 235), proposing three solution approaches: scoped chown targeting specific changed paths, root cause fix during file creation, and test modification for permission validation, with detailed benefits analysis covering layer size reduction and build time optimization, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#2215
File: runtimes/datascience/ubi9-python-3.12/Dockerfile.cpu:0-0
Timestamp: 2025-09-05T12:10:50.856Z
Learning: jiridanek requested GitHub issue creation for Dockerfile environment variable refactoring during PR #2215 review. Issue #2311 was created addressing build-only variables (OPENBLAS_VERSION, ONNX_VERSION, GRPC_PYTHON_BUILD_SYSTEM_OPENSSL) being unnecessarily written to /etc/profile.d/ppc64le.sh in runtimes/datascience/ubi9-python-3.12/Dockerfile.cpu, causing variable duplication across stages, unreliable sourcing in non-login build contexts, and violation of DRY principles. The issue includes comprehensive problem description covering affected lines 30-37, detailed impact analysis of build reliability and maintenance overhead, three solution options with centralized ARG/ENV approach as recommended, clear acceptance criteria for version centralization and build-only variable cleanup, and specific implementation guidance with code examples, assigned to jiridanek, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#2265
File: .tekton/odh-workbench-jupyter-pytorch-llmcompressor-cuda-py312-ubi9-pull-request.yaml:16-16
Timestamp: 2025-09-05T12:10:28.916Z
Learning: jiridanek requested GitHub issue creation for trigger path cleanup in pytorch+llmcompressor pipeline during PR #2265 review. Issue #2310 was successfully created addressing copy-paste errors where irrelevant Minimal/DataScience trigger paths were included in the pytorch+llmcompressor pipeline on-cel-expression, causing unnecessary pipeline triggers. The issue includes comprehensive problem description covering specific irrelevant paths, detailed solution with before/after YAML code examples, clear acceptance criteria for implementation and testing, repository-wide scope consideration for similar issues, and proper context linking to PR #2265 review comment, assigned to jiridanek.
Learnt from: jiridanek
PR: opendatahub-io/notebooks#2185
File: jupyter/pytorch/ubi9-python-3.12/Dockerfile.cuda:109-117
Timestamp: 2025-08-29T08:48:55.985Z
Learning: jiridanek prefers to implement systematic cleanup improvements through dedicated GitHub issues (like #2076) rather than applying individual point fixes during PR reviews, maintaining consistency with their established pattern of comprehensive code quality improvements.
📚 Learning: 2025-09-10T21:21:02.884Z
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. The issue addresses lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.

Applied to files:

  • ci/cached-builds/gha_lvm_overlay.sh
📚 Learning: 2025-09-10T21:21:02.884Z
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. Issue #2398 was created addressing lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.

Applied to files:

  • ci/cached-builds/gha_lvm_overlay.sh
📚 Learning: 2025-07-22T07:30:12.760Z
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-07-22T07:30:12.760Z
Learning: jiridanek requested GitHub issue creation for GitHub Actions LVM setup failure during PR #1425 review, specifically addressing ext4 signature detection causing lvcreate to fail with interactive prompts. Issue #1434 was successfully created with comprehensive problem description covering stale filesystem signatures, interactive prompt failures, runner cleanup issues, disk space pressure, detailed technical analysis, multiple solution options (immediate signature wiping, robust pre-cleanup logic, enhanced runner cleanup, alternative volume strategies), clear acceptance criteria, implementation guidance, risk mitigation strategies, and proper context linking. This infrastructure issue is related to but distinct from existing Issue #1196 about fallocate failures in the same gha_lvm_overlay.sh script, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.

Applied to files:

  • ci/cached-builds/gha_lvm_overlay.sh
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: code-static-analysis

Copy link
Contributor

openshift-ci bot commented Sep 23, 2025

@jiridanek: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/images 9b2194c link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@jiridanek
Copy link
Member Author

/hold could not get a clean build on gha with this

@jiridanek
Copy link
Member Author

@coderabbitai I've been playing with various sizes, but I did not see a successful build after reducing the swap size yet.

Copy link
Contributor

coderabbitai bot commented Oct 10, 2025

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash

# Get the current state of the gha_lvm_overlay.sh file
echo "=== Current gha_lvm_overlay.sh content ==="
cat ci/cached-builds/gha_lvm_overlay.sh

echo -e "\n=== Issue #2404 details ==="
gh issue view 2404 --json title,body,state,comments --jq '{title, body, state, comments: [.comments[] | {author: .author.login, body: .body}]}'

echo -e "\n=== Issue #2398 details (from learning) ==="
gh issue view 2398 --json title,body,state,comments --jq '{title, body, state, comments: [.comments[] | {author: .author.login, body: .body}]}'

echo -e "\n=== Recent PR runs ==="
gh run list --repo opendatahub-io/notebooks --limit 10 --json number,status,conclusion,createdAt,headBranch --jq '.[] | select(.headBranch == "swp")'

echo -e "\n=== PR diff ==="
gh pr diff 2529

Length of output: 14321


🏁 Script executed:

#!/bin/bash

echo "=== Checking workflow run 17941484896 ==="
gh run view 17941484896 --repo jiridanek/notebooks --json jobs,conclusion,status --jq '{conclusion, status, jobs: [.jobs[] | {name, conclusion, steps: [.steps[] | select(.conclusion == "failure") | {name, conclusion}]}]}'

echo -e "\n=== Checking workflow run 17946984207 ==="
gh run view 17946984207 --repo jiridanek/notebooks --json jobs,conclusion,status --jq '{conclusion, status, jobs: [.jobs[] | {name, conclusion, steps: [.steps[] | select(.conclusion == "failure") | {name, conclusion}]}]}'

echo -e "\n=== Get logs for failed jobs in 17941484896 ==="
gh run view 17941484896 --repo jiridanek/notebooks --log-failed | head -500

echo -e "\n=== Get logs for failed jobs in 17946984207 ==="
gh run view 17946984207 --repo jiridanek/notebooks --log-failed | head -500

echo -e "\n=== Check recent runs on swp branch ==="
gh run list --repo jiridanek/notebooks --branch swp --limit 20 --json databaseId,conclusion,name,createdAt

echo -e "\n=== Check if there are memory/swap related errors in recent logs ==="
# Get the most recent run and check for OOM or swap issues
LATEST_RUN=$(gh run list --repo jiridanek/notebooks --branch swp --limit 1 --json databaseId --jq '.[0].databaseId')
if [ -n "$LATEST_RUN" ]; then
  echo "Checking run $LATEST_RUN for memory issues..."
  gh run view "$LATEST_RUN" --repo jiridanek/notebooks --log | grep -i -E "(oom|out of memory|swap|killed|memory|freeze)" | head -100
fi

Length of output: 171188


🏁 Script executed:

#!/bin/bash

echo "=== Get detailed failure logs from run 17941484896 for rocm-jupyter-tensorflow ==="
gh run view 17941484896 --repo jiridanek/notebooks --log | \
  grep -i -B5 -A20 "rocm-jupyter-tensorflow-ubi9-python-3.12" | \
  grep -i -E "(error|fail|killed|oom|memory|swap)" | head -100

echo -e "\n=== Get logs from codeserver-ubi9 arm64 build failure in 17941484896 ==="
gh run view 17941484896 --repo jiridanek/notebooks --log | \
  grep -i -B10 -A30 "codeserver.*arm64" | \
  grep -i -E "(error|fail|killed|oom|memory|swap|freeze)" | head -100

echo -e "\n=== Check memory/swap stats during build in latest run ==="
# Get the latest run ID on swp branch
LATEST_RUN=$(gh run list --repo jiridanek/notebooks --branch swp --limit 1 --json databaseId --jq '.[0].databaseId')
echo "Latest run: $LATEST_RUN"

if [ -n "$LATEST_RUN" ]; then
  # Look for memory-related info in logs
  gh run view "$LATEST_RUN" --repo jiridanek/notebooks --log 2>/dev/null | \
    grep -i -E "swap|memory|oom|killed|freeze" | head -50
fi

echo -e "\n=== Check if there are specific OOM or memory errors in failed jobs ==="
gh run view 17946984207 --repo jiridanek/notebooks --log | \
  grep -i -E "(out of memory|oom killer|cannot allocate memory|memory exhausted)" | head -50

Length of output: 628


🏁 Script executed:

#!/bin/bash

echo "=== Get job IDs for failed jobs in run 17941484896 ==="
gh run view 17941484896 --repo jiridanek/notebooks --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name: .name, id: .databaseId, conclusion: .conclusion}' | head -20

echo -e "\n=== Check a specific failed job - codeserver arm64 ==="
# Get the job ID for codeserver-ubi9-python-3.12 arm64
JOB_ID=$(gh run view 17941484896 --repo jiridanek/notebooks --json jobs --jq '.jobs[] | select(.name | contains("codeserver-ubi9-python-3.12") and contains("arm64")) | .databaseId' | head -1)

if [ -n "$JOB_ID" ]; then
  echo "Job ID: $JOB_ID"
  gh run view --job="$JOB_ID" --repo jiridanek/notebooks --log 2>&1 | tail -200
fi

echo -e "\n=== Check for build vs test phase failures ==="
# Let's look at the summary of what failed
gh run view 17946984207 --repo jiridanek/notebooks --json jobs --jq '.jobs[] | select(.conclusion == "failure") | {name: .name, steps: [.steps[] | select(.conclusion == "failure") | .name]}'

Length of output: 3187


🏁 Script executed:

#!/bin/bash

echo "=== Check specific arm64 build that failed ==="
# The URLs were provided in the PR description
echo "Checking the referenced URLs from PR description:"
echo "Run 17941484896: https://github.com/jiridanek/notebooks/actions/runs/17941484896"
echo "Run 17946984207: https://github.com/jiridanek/notebooks/actions/runs/17946984207/job/51036494302"

echo -e "\n=== Analysis of failure patterns ==="
echo "Test failures (likely NOT swap-related):"
gh run view 17946984207 --repo jiridanek/notebooks --json jobs --jq '.jobs[] | select(.conclusion == "failure") | select(.steps[] | .conclusion == "failure") | {name: .name, failed_step: [.steps[] | select(.conclusion == "failure") | .name][0]}' 

echo -e "\n=== Builds that failed during build phase (potentially swap-related) ==="
gh run view 17946984207 --repo jiridanek/notebooks --json jobs --jq '.jobs[] | select(.conclusion == "failure") | select([.steps[] | select(.conclusion == "failure")] | length == 0) | .name'

echo -e "\n=== Check latest successful runs to compare ==="
gh run list --repo jiridanek/notebooks --branch main --limit 5 --json conclusion,databaseId,createdAt,displayTitle | jq '.[] | select(.conclusion == "success")'

echo -e "\n=== Memory usage comparison from Issue #2404 ==="
echo "Based on issue #2404, codeserver builds use ~2.1-2.2 GB heap"
echo "Current swap: 8196 MB (~8 GB)"
echo "Previous swap: 16384 MB (16 GB)"
echo "User mentioned arm64 froze with 2GB and 4GB swap"

Length of output: 2714


@jiridanek, I've analyzed the build failures and here's what I found:

Analysis of Build Failures

The failures fall into two distinct categories:

1. Test Failures (Likely NOT swap-related)

Most failures occur during test phases:

  • PyTest container tests
  • OpenShift tests
  • Image tests
  • Make failures

These are happening after successful builds and are likely unrelated to swap size.

2. ARM64 Build Failures (Potentially swap-related) ⚠️

Two ARM64 builds failed during the build phase with no specific failed test step:

  • cuda-jupyter-tensorflow-ubi9-python-3.12 (arm64)
  • codeserver-ubi9-python-3.12 (arm64)

This pattern is concerning because:

  • You previously mentioned arm64 froze with 2GB and 4GB swap
  • These are the same builds that had issues before
  • The failure occurs during build, not test phase

Memory Usage Observations

Issue #2404 shows codeserver uses ~2.1-2.2 GB heap, but:

  • That's only JavaScript heap usage during the mangling phase
  • Total system memory during compilation (including C/C++ compilation, linking, etc.) could be significantly higher
  • ARM64 runners may have different memory characteristics

Recommendations

Given your experimentation history, here are your options:

Option 1: Revert to 16GB (Conservative)

Keep the original swap_size_mb=16384 until you can identify the minimum viable size through more targeted testing.

Option 2: Implement Dynamic Sizing (from Issue #2398)

Apply the solution from Issue #2398 which caps swap at 20% of VG free space (minimum 1024MB). This provides safety against runner variations.

Option 3: Test Intermediate Sizes

Try 12GB (12288 MB) as a middle ground between 8GB and 16GB, specifically monitoring ARM64 builds.

Option 4: Architecture-Specific Swap

Consider different swap sizes for different architectures:

if [[ "$BUILD_ARCH" == *"arm64"* ]]; then
  swap_size_mb=12288  # or 16384
else
  swap_size_mb=8196
fi

Would you like me to help implement any of these options, or would you prefer to gather more data first by checking specific ARM64 build logs?


🧠 Learnings used
Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. Issue #2398 was created addressing lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.

Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-09-10T21:21:02.884Z
Learning: jiridanek requested GitHub issue creation for swap creation resilience improvement in ci/cached-builds/gha_lvm_overlay.sh during PR #2356 review. The issue addresses lvcreate failures on small runners by implementing dynamic swap sizing (20% of VG free space, minimum 1024MiB), using mkswap -f flag to prevent signature prompts, and reducing vm.swappiness to 10 for less aggressive swapping, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.

Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-08-05T17:24:08.616Z
Learning: jiridanek requested PR review for #1521 covering s390x architecture support improvements, demonstrating continued focus on systematic multi-architecture compatibility enhancements in the opendatahub-io/notebooks repository through clean implementation with centralized configuration, proper CI integration, and architecture-aware testing patterns.

Learnt from: jiridanek
PR: opendatahub-io/notebooks#0
File: :0-0
Timestamp: 2025-07-22T07:30:12.760Z
Learning: jiridanek requested GitHub issue creation for GitHub Actions LVM setup failure during PR #1425 review, specifically addressing ext4 signature detection causing lvcreate to fail with interactive prompts. Issue #1434 was successfully created with comprehensive problem description covering stale filesystem signatures, interactive prompt failures, runner cleanup issues, disk space pressure, detailed technical analysis, multiple solution options (immediate signature wiping, robust pre-cleanup logic, enhanced runner cleanup, alternative volume strategies), clear acceptance criteria, implementation guidance, risk mitigation strategies, and proper context linking. This infrastructure issue is related to but distinct from existing Issue #1196 about fallocate failures in the same gha_lvm_overlay.sh script, continuing the established pattern of systematic infrastructure improvements through detailed issue tracking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold do-not-merge/work-in-progress review-requested GitHub Bot creates notification on #pr-review-ai-ide-team slack channel size/xs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Reevaluate memory settings for code-server builds
3 participants